The last few times we have been talking about using a regression to characterize a relationship between variables. As we saw, by making a few assumptions, we can use either linear or logistic regression to summarize not only how variables are related to one another, but also how to use that relationship to predict out-of-sample (or future) observations.
We are now going to introduce a different algorithm - the
kmeans algorithm. This specific algorithm is part of a
larger class of algorithms/models that have been designed to identify
relationships within the data. This is sometimes called “clustering” or
“segmentation”. The goal is to figure out clusters/segments of data that
are “similar” based on observable features.
The goal is to therefore try to figure out an underlying structure in our data. That is, we want to use the data to learn about which observations are more or less similar. Because I do not know what the true relationship is, what we are doing is sometimes called “Unsupervised” learning. In contrast, “supervised” learning is when we are actively bringing information in and “supervising” the characterization being done. We will see an example of this in a few lectures, but for now we are going to start with an unsupervised approach.
Efforts to characterize the relationship within data to determine which observations cluster together (or are segmented) is often an important first step for determining the empirical regularity of interest.
This is what dating sites (e.g., e-harmony) do when they try to figure out which individuals are more or less similar. This is what Facebook and Tik-Tok does when it tries to determine what to show you in your feed. This is what Netflix does when recommending your next series to watch. Personality tests and profiles are another example of this. These tools are also used in marketing to identify likely consumers and by political campaigns to figure out which voters should be targeted and perhaps even how. What they actually do is obviously more complicated, but the basic idea is very, very similar to what we are going to learn today.
One thing that will become quickly apparent is that how we measure something can have profound implications on what it means - especially if we have no theory to guide us in the organization/analysis of data. Sometimes data exploration = measurement = discovery.
It is also important to note that nothing we are doing is causal – the algorithm is silent as to why the relationships exist. It is equally important to note that the analysis is descriptive, not predictive. This is a critical point with profound implications – if you identify segments in your data and they take proactive steps using that information, the steps you take may affect how future data is clustered.
The algorithm we are going to use is one of the earliest implementations of clustering and it is very simple in what it does. There are many more complicated procedures and models, but for the purposes of illustrating the general idea (and also the generic limitations of this kind of “unsupervised learning”) it is easiest to start with what is perhaps one of the simplest clustering methods.
The procedure used by the k-means clustering algorithm consists of several steps”
K – hence the name kmeans.K
clusters in the multidimensional space (where the number of dimensions
is the number of variables).K centroids, the computer assigns
each observation to the cluster whose centroid is the closest (in terms
of Euclidian distance).So, if there are two variables, say \(X\) and \(Y\) and we are fitting 2 centroids, the computer will begin by randomly choosing a “centroid” for each cluster – which is simply a point in \((x,y)\). Say \((x_1,y_1)\) for cluster 1 and \((x_2,y_2)\) for cluster 2. Then given this choice, the computer figures out which centroid is “closest” to each data point. So for data point \(i\) that is located at \((x_i,y_i\)) the computer computes the distance from each.
Given this, the Euclidean distance to cluster 1 is simply: \[ (x_1 -x_i)^2 + (y_1 - y_i)^2 \] And the Euclidean distance to cluster 2: \[ (x_2 -x_i)^2 + (y_2 - y_i)^2 \] (Note that if we have more variables we just include them in a similar fashion.) Having calculcated the distance to each of the \(K\) centroids – here 2 – we now assign each datapoint to either cluster “1” or “2” depending on which is smaller. After doing this for every data point, we then calculate a new centroid by taking the average of all of the points in each cluster in each variable. So if there are \(n_1\) observations allocated to cluster 1 and \(n_2\) observations allocated to cluster 2 we would compute the new centroids using:
\[x_1 = \sum_i^{n_1} \frac{x_i}{n_1}\] \[y_1 = \sum_i^{n_1} \frac{y_i}{n_1}\] \[x_2 = \sum_i^{n_2} \frac{x_i}{n_2}\] \[y_2 = \sum_i^{n_2} \frac{y_i}{n_2}\]
Now, using these new values for \((x_1,y_1)\) for the centroid of cluster 1 and \((x_2,y_2)\) for the centroid of cluster 2 we reclassify all the observations to allocate each observation to the cluster with the closest centroid. We then recalculate the centroid for each cluster after this reallocation and then we iterate over these two steps until no data points change their cluster assignment.
Given this, how sensitive is this to the scale of the variables? What does that imply?
Predicting elections requires using votes that are counted to make predictions for “similar counties.” There are lots of ways to determine similarity based on past voting behavior (and demographics).
Entire books have been written that try to determine how many “Americas” there are. For example…

And many “quizzes” are produced to determine what type of voter you are (more on this later!). For example: quiz from the NY Times.
These are all products that are based on various types of clustering analyses that try to detect pattern in data.
So let’s start simple and think about the task of predicting what is going to happen in a state on Election Night. To do so we want to segment the state into different politically relevant regions so that we can track how well candidates are doing. Or, if you are working for a candidate, which counties should be targeted in get-out-the-vote efforts.
We are going to be working with two datasets. A dataset of votes cast
in Florida counties in the 2016 election
(FloridaCountyData.Rds) and also a dataset of the
percentage of votes cast for Democratic and Republican presidential
candidates in counties (or towns) for the 2004, 2008, 2012, 2016, and
2020 elections (CountyVote2004_2020.Rds).
To begin, let’s start with Florida in 2016.
library(tidyverse)
library(tidymodels)
library(plotly)
dat <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/FloridaCountyData.Rds")
glimpse(dat)
## Rows: 67
## Columns: 15
## $ fips_code <int> 12001, 12003, 12005, 12007, 12009, 12011, 12013, 12…
## $ county_name <chr> "Alachua", "Baker", "Bay", "Bradford", "Brevard", "…
## $ eligible_voters <int> 173993, 15092, 118344, 16163, 411191, 1141360, 8620…
## $ party_stratum <int> 2, 5, 5, 5, 4, 1, 5, 5, 5, 5, 5, 5, 1, 5, 5, 3, 4, …
## $ party_stratum_name <chr> "Mod Democrat", "High Republican", "High Republican…
## $ geo_stratum_name <chr> "North/Panhandle", "North/Panhandle", "North/Panhan…
## $ Trump <int> 46834, 10294, 62194, 8913, 181848, 260951, 4655, 60…
## $ Clinton <int> 75820, 2112, 21797, 2924, 119679, 553320, 1241, 334…
## $ Johnson <int> 4059, 169, 2652, 177, 9451, 11078, 124, 1946, 1724,…
## $ Stein <int> 1507, 30, 562, 47, 2708, 5094, 25, 567, 480, 571, 7…
## $ geo_strata <fct> North/Panhandle, North/Panhandle, North/Panhandle, …
## $ Total2012 <int> 128569, 12634, 87449, 12098, 314744, 831950, 6081, …
## $ Total2016 <int> 128220, 12605, 87205, 12061, 313686, 830443, 6045, …
## $ PctTrump <dbl> 0.3652628, 0.8166601, 0.7131931, 0.7389934, 0.57971…
## $ PctClinton <dbl> 0.5913274, 0.1675526, 0.2499513, 0.2424343, 0.38152…
The first task that we face as data scientists is to determine which
variables are relevant for clustering. Put differently, which variables
define the groups we are trying to find. kmeans is a very
simple algorithm and it assumes that every included variable is equally
important for the clustering that is recovered. As a result, if you
include only garbage/irrelevant data the relationships you find will
also be garbage. The alogorithm is unsupervised in that it has no idea
which variables are more or less valuable to what you are trying to
find. It is simply trying to find how the data clusters together given
the data you have given it! It cannot evaluate the quality of the data
you provide.
As a result, when doing kmeans we often start with
simple visualization around data that we think is likely to be of
interest. To make it interactive we can again use plotly
package and include some text information in the
ggplot aesthetic. We will also clean up the
labels and override the default of scientific notation. We will also
change the name of the legend to make it descriptive and
interpretable.
gg<- dat %>%
ggplot(aes(x = Trump, y = Clinton, color = geo_strata,
text=paste(county_name))) +
geom_point(alpha = 0.3) +
scale_x_continuous(labels=comma) +
scale_y_continuous(labels=comma) +
labs(x="Number of Trump Votes",
y="Number of Clinton Votes",
title="Florida County Votes in 2016",
color = "Region")
ggplotly(gg,tooltip = "text")
So this suggests that counties with more Clinton votes tend to covary with counties with more Trump voters. Obviously. So we have discovered that there are more votes for both candidates in larger counties. So if we were to cluster based on this we would essentially find groups based on population size. Not useful.
So maybe we should look at the percentage of votes rather than the
number of votes. Let’s see. Again let’s improve the labels and axes and
make it interactive using plotly to show how the code
differs from the default syntax above.
gg <- dat %>%
ggplot(aes(PctTrump, PctClinton, color = geo_strata,
text=paste(county_name))) +
geom_point(alpha = 0.3) +
scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
labs(x="Pct of Trump Votes",
y="Pct of Clinton Votes",
title="Florida County Vote Share in 2016",
color = "Region")
ggplotly(gg,tooltip = "text")
So now we see that places with a higher percentage of support for Clinton have a lower support for Trump. That seems useful if we are interested in characterizing the political context of a county.
But those are highly correlated? Do we need both percentage that support Clinton and also the percentage that support Trump? It seems like that is the same information being “double-counted.” What if we include something like the number of eligible voters?
gg <- dat %>%
ggplot(aes(PctTrump, eligible_voters, color = geo_strata,
text=paste(county_name))) +
geom_point(alpha = 0.3) +
scale_y_continuous(label=comma, breaks=seq(0,2000000,by=125000)) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
labs(x="Pct of Trump Votes",
y="Number of Eligible Voters",
main="Florida County Election Results in 2016",
color = "Region")
ggplotly(gg,tooltip = "text")
Now things get trickier. Do we want to segment based on the number of
eligible voters? Or is it more useful to focus on support for Clinton
and Trump? This decision is hugely consequential for the groups
kmeans will recover. This again highlights the role of the
data scientist – you need to make a decision and justify it
because the decision will be consequential!
To start let’s characterize counties by support for Clinton and Trump.
rawvote <- dat %>%
select(c(PctTrump,PctClinton)) %>%
drop_na()
Now we run by providing the data frame of all numeric data and the
number of clusters – here centers that we want the
algorithm to find for us.
fl.cluster1 <- kmeans(rawvote, centers=2)
Now call the object to see what we have just created and all of the objects that we can now work with.
fl.cluster1
## K-means clustering with 2 clusters of sizes 25, 42
##
## Cluster means:
## PctTrump PctClinton
## 1 0.4780951 0.4927738
## 2 0.7073845 0.2679218
##
## Clustering vector:
## [1] 1 2 2 2 1 1 2 2 2 2 2 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2 1 1 2
## [39] 2 1 1 2 2 1 2 2 2 1 1 1 2 1 1 2 2 1 2 1 1 2 2 2 2 1 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 0.4586045 0.4019281
## (between_SS / total_SS = 65.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Alternatively, we can also call the tidy function to
produce a tibble of the overall fit:
clusters <- tidy(fl.cluster1)
clusters
## # A tibble: 2 × 5
## PctTrump PctClinton size withinss cluster
## <dbl> <dbl> <int> <dbl> <fct>
## 1 0.478 0.493 25 0.459 1
## 2 0.707 0.268 42 0.402 2
In this object you can see the mean value for each variable in each cluster – i.e., the centroid – as well as the number of observations (here counties) belonging to each cluster, and also the within sum of squares for each cluster. Recall that the centroid for each cluster is simply the average value of the variable for all counties that are assigned to that cluster. (This is the same as the \(x_1\), \(y_1\), \(x_2\), and \(y_2\) defined above.)
The values associated with withinss are the within sum
of squares for all observations in a cluster. This is the sum of the
squared distances between each data point in the cluster and the
centorid of that cluster. So if \(T_i\)
denotes the value of PctTrump for county \(i\) and \(C_i\) denotes the value of
PctClinton for county \(i\) the within sum of squares for the \(n_1\) counties that belong to cluster \(1\) is given by:
\[\sum_i^{n_1} (\bar{T}_1 - T_i)^2 + (\bar{C}_1 - C_i)^2\]
if we use \(\bar{T}_1\) to denote the mean of support for Trump in the \(n_1\) counties allocated to cluster 1 and \(\bar{C}_1\) to denote the mean support for Clinton in those \(n_1\) counties. We will return to this later.
One important thing to note is that kmeans starts the
algorithm by randomly choosing a centroid for each cluster and then
iterating until no classifications change cluster. As a result, the
clusters we identify can depend on the initial choices and there is
nothing to ensure that this results in an “optimal” in a global sense.
The classification is conditional on the initial start and the
optimization is “local” and relative to that initial choice. So make
sure you always set a seed!
SELF_TEST Do a new clustering with 4 centroids called
florida.me and look at the contests of each cluster using
tidy:
florida.me <- kmeans(rawvote, centers=4)
tidy(florida.me)
## # A tibble: 4 × 5
## PctTrump PctClinton size withinss cluster
## <dbl> <dbl> <int> <dbl> <fct>
## 1 0.549 0.421 18 0.0525 1
## 2 0.665 0.310 23 0.0557 2
## 3 0.362 0.610 9 0.0359 3
## 4 0.778 0.198 17 0.0504 4
Because kmeans is choosing random start values to start
the classification the start values will matter, especially when you are
fitting a lot of clusters to a high dimensional dataset (i.e., lots of
variables). Even when you are fitting a model with few clusters and few
variables the start values may impact the clustering that is found.
To illustrate this lets analyze the same data using the same number of clusters using a different seed value.
set.seed(13469) # set new seed value
fl.cluster2 <- kmeans(rawvote,centers=2) # new clustering
Now lets compare how the clusters found in fl.cluster1
compare to the clusters found in the new clustering
(fl.cluster2) using thetable function.
table(fl.cluster1$cluster,fl.cluster2$cluster) # compare clusters
##
## 1 2
## 1 25 0
## 2 0 42
So we can see the classification is exactly flipped. The observations are still largely clustered into the same clustering, but the labels of those clusters is changed. Even though the same information is recovered in both clusterings, the labels of the clusters has changed. This point is essential for replication!
Typically the information we are most interested in is how the
observations are clustered – i.e., the labels contained in the
cluster variable. So how do we get this information back to
our original tibble? Thankfully there is a function for that. The
augment function will add the cluster variable
from the kmeans clustering onto a tibble containing the data used in the
clustering.
dat.cluster <- augment(fl.cluster1,dat)
With this new augmented tibble – dat.cluster – we can
now visualize how well the recovered clusters correspond with the
underlying data. While this can be more challenging when we are working
with high-dimensional data (i.e.., lots of variables) in this case we
can visualize the relationship because we have used only two variables
in the clustering to characterize the political leanings of counties in
Florida.
If I want to plot the points using the cluster label, I can use the
geom_text code to include the label passed to
the ggplot aesthetic.
dat.cluster %>%
ggplot(aes(x = PctTrump, y = PctClinton, label = .cluster)) +
geom_text() +
scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
labs(title = "Florida Counties: 2016",
x = "% Trump in 2016",
y = "% Clinton in 2016")
But maybe that is too messy. The overlapping numbers is a bit distracting.
So let’s switch to colored points and add in the location of the
centroids. This is useful for reminding us of what kmeans
is actually doing. Let us pull this information from the
clusters object we created using the tidy()
function applied to our kmeans object. Recall that the
location of the centroid is just the average of every variable being
analyzed. NOTE: what happens if we replace color with
fill?
gg <- ggplot() +
geom_point(data=dat.cluster, aes(x = PctTrump, y = PctClinton, color = .cluster,
text=paste(county_name))) +
geom_point(data=clusters, aes(x = PctTrump, y = PctClinton), size = 10, shape = "+") +
labs(color = "Cluster",
title = "Florida Counties: 2016",
x = "% Trump in 2016",
y = "% Clinton in 2016") +
scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1))
ggplotly(gg,tooltip = "text")
Recall that there is no global minimization being done in this algorithm – all we are doing is starting with a randomly chosen centroid and then doing a (local) minimization given those start values. As a result, you can get different classifications with different start values. Here is a simple example that again shows the sensitivity to start values.
set.seed(42)
fl.cluster1 <- kmeans(rawvote,centers=2)
set.seed(13469)
fl.cluster2 <- kmeans(rawvote,centers=2)
table(fl.cluster1$cluster,fl.cluster2$cluster)
##
## 1 2
## 1 21 0
## 2 4 42
But we can use nstart to try multiple initial
configurations and use the one that produces the best total within sum
of squares given the number of centers being chosen. Given that we are
only classifying based on two variables let’s try 25 different start
values.
set.seed(42)
fl.cluster1 <- kmeans(rawvote,centers=2,nstart=25)
set.seed(13469)
fl.cluster2 <- kmeans(rawvote,centers=2,nstart=25)
table(fl.cluster1$cluster,fl.cluster2$cluster)
##
## 1 2
## 1 0 21
## 2 46 0
So now you can see that using multiple start values eliminates the
classification differences based on the initial start value! Now the
clusters have the same counties in each – although with different names.
What nstart is doing is having the algorithm try a bunch of
different start values and then choose the centroid that has the lowest
within sum of squares as the starting value. So while this is not doing
a search over every possible start value, it chooses the “best” start
value among the set of values it generates.
Now think about doing a kmeans for three clusters. Based
on the figure we just created, where do you think the three clusters
will be located.
SELF TEST Now implement this! What do you observe? Were you correct?
# INSERT CODE HERE
So what if we did cluster based on the number of votes cast? How
would that affect the conclusions we get? Instead of clustering based on
PctTrump and PctClinton do the clustering
using Trump and Clinton. Can you predict what
will happen before you do it?
rawvote <- dat %>%
select(c(Trump,Clinton)) %>%
drop_na()
fl.cluster1.count <- kmeans(rawvote, centers=2)
dat.cluster2 <- augment(fl.cluster1.count,dat)
clusters <- tidy(fl.cluster1.count)
clusters
## # A tibble: 2 × 5
## Trump Clinton size withinss cluster
## <dbl> <dbl> <int> <dbl> <fct>
## 1 254330. 375619. 7 161451108936. 1
## 2 47293. 31261. 60 226694181010. 2
Now graph the new clusters, labeling the county names.
gg <- ggplot() +
geom_point(data=dat.cluster2, aes(x = Trump, y = Clinton, color = .cluster,
text=paste(county_name))) +
geom_point(data=clusters, aes(x = Trump, y= Clinton), size = 10, shape = "+") +
labs(color = "Cluster",
title = "Florida Counties",
x = "Votes for Trump",
y = "Votes for Clinton") +
scale_y_continuous(label=comma) +
scale_x_continuous(label=comma)
ggplotly(gg,tooltip = "text")
And finally, how do the clusters compare to one another? If you do a table of the clusters against one another what do you observe?
table(ByPct = fl.cluster1$cluster,
ByVote= fl.cluster1.count$cluster)
## ByVote
## ByPct 1 2
## 1 7 14
## 2 0 46
The scale matters. What if we do a clustering using
eligible voters, PctTrump and
PctClinton. What do you observe for a clustering of these
variables using k=2 clusters?
rawvote3 <- dat %>%
select(c(eligible_voters,PctClinton,PctTrump))
cluster.mix <- kmeans(rawvote3,centers=2,nstart=25)
tidy(cluster.mix)
## # A tibble: 2 × 6
## eligible_voters PctClinton PctTrump size withinss cluster
## <dbl> <dbl> <dbl> <int> <dbl> <fct>
## 1 110305. 0.327 0.647 60 842735627916. 1
## 2 893950 0.564 0.407 7 483680203618. 2
Now rescale the data being fit using the scale function
to normalize the data to have mean 0 and variance 1. (Note that
scale just normalizes a data.frame object.) Why is this
important given the alogorithm being used? Now cluster the rescaled
data. How do the resulting clusters compare to the unrescaled clusters
and also our original scaling based on the percentages –
fl.cluster1?
rawvote3.scale <- scale(rawvote3)
summary(rawvote3.scale)
## eligible_voters PctClinton PctTrump
## Min. :-0.67089 Min. :-1.84762 Min. :-2.29670
## 1st Qu.:-0.62953 1st Qu.:-0.77653 1st Qu.:-0.50178
## Median :-0.35021 Median :-0.02995 Median : 0.06301
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.09304 3rd Qu.: 0.48713 3rd Qu.: 0.67141
## Max. : 4.26750 Max. : 2.42012 Max. : 1.88340
cluster.mix2 <- kmeans(rawvote3.scale,centers=2,nstart=25)
tidy(cluster.mix2)
## # A tibble: 2 × 6
## eligible_voters PctClinton PctTrump size withinss cluster
## <dbl> <dbl> <dbl> <int> <dbl> <fct>
## 1 0.984 1.21 -1.22 20 52.7 1
## 2 -0.419 -0.515 0.518 47 33.7 2
How to compare? Start with the normalized vs. unnormalized clustering (i.e., same variables but different scale).
table(Normalized = cluster.mix2$cluster,
Unnormalized = cluster.mix$cluster)
## Unnormalized
## Normalized 1 2
## 1 13 7
## 2 47 0
Now compare to the original clustering we did using vote share (i.e., different variables and different scale).
table(Normalized = cluster.mix2$cluster,
ByPct = fl.cluster1$cluster)
## ByPct
## Normalized 1 2
## 1 18 2
## 2 3 44
This means…
Perhaps we need more data. Lets get all of the county (or town) level data from 2004 up through 2020. Let’s focus on Florida again.
dat.all <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/CountyVote2004_2020.Rds")
dat.fl <- dat.all %>%
filter(state=="FL")
For now, let us work with pct_rep_2016 and
pct_rep_2020 – but try replicating the results using a
different choice to see what happens. Note that kmeans
takes a data frame with all numeric columns so let’s start by creating a
new tibble with just numeric data and no missingness.
rawvote <- dat.fl %>%
select(c(pct_rep_2004,pct_rep_2008,pct_rep_2012,pct_rep_2016,pct_rep_2020)) %>%
drop_na()
Again we can start by visualizing the relationship. Since we can only think in 2 dimensions, let’s look at some.
rawvote %>%
ggplot(aes(x=pct_rep_2016, y=pct_rep_2020)) +
geom_point() +
labs(x="% Trump 2016",
y = "% Trump 2020",
title = "Trump Support in Florida Counties: 2016 & 2020") +
geom_abline(intercept=0,slope=1) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1))
What if we compare Republican vote share in 2004 and 2020. What does that show?
rawvote %>%
ggplot(aes(x=pct_rep_2004, y=pct_rep_2020)) +
geom_point() +
labs(x="% Republican 2004",
y = "% Republican 2020",
title = "Republican Support in Florida Counties: 2004 & 2020") +
geom_abline(intercept=0,slope=1) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1))
So a critical question is always – how many clusters should I use? An issue with answering this question is that there really isn’t a statistical theory to guide this determination. More clusters will always “explain” more variation, and if we choose the number of clusters equal to the number of data points we will perfectly “fit/explain” the data. But it will be a trivial explanation and not give us any real information. Recall that one of the goals is to use the clustering to reduce the dimensionality of the data in a way that recovers a “meaningful” representation of the underlying data.
So let us explore how the clustering changes for different numbers of
centers. What we are going to do is to create a tibble called
kcluster.fl that is going contain the results of a
kmeans clustering for 10 different choices of
K that varies from 1 to 10.
kcluster.fl <-
tibble(K = 1:10) %>% # define a sequence that we will use to denote k
mutate( # now we are going to create new variables in this tibble
kcluster = map(K, ~kmeans(rawvote, .x, nstart = 25)), # run a kmeans clustering using k
tidysummary = map(kcluster, tidy), # run the tidy() function on the kcluster object
augmented = map(kcluster, augment, rawvote), # save the cluster to the data
)
The above code uses the map function which is how we can
apply a function across an index. So in line 419 we are going to take
map to apply the sequence of K values we
defined running from 1 to 10 the kmeans() algorithm applied
to the rawvotes tibble. The .x in the line
reveals where we are going to substitute the value being mapped. So the
object kcluster is going to be a list of 10 elements – each
list element being a kmeans object associated with the
choice of K centers.
We then map the tidy function to the list of
kcluster we just created – creating a list called
tidysummary where each element is the summary associated
with the kth clustering. We next create a tibble that augments the
original data being clustered – rawvotes with the cluster
label associated with the kth clustering. (But remember that the meaning
of those labels is not fixed!)
So let’s give in to see what we have just done. If we take a look at
the first row of kcluster.fl we can see that it consists of
a vector of the sequence of k’s we defined, and then the three list
objects we created – kcluster, tidysummary,
and augmented.
kcluster.fl[1,]
## # A tibble: 1 × 4
## K kcluster tidysummary augmented
## <int> <list> <list> <list>
## 1 1 <kmeans> <tibble [1 × 8]> <tibble [67 × 6]>
But to work with these objects we need to extract these lists. Lists
are a pain in R – especially when you are starting out – so do not think
too hard about the following. What we are going to essentially do is to
extract each of the list objects in kcluster.fl to a
separate tibble.
clusters <- kcluster.fl %>%
unnest(cols=c(tidysummary))
clusters
## # A tibble: 55 × 11
## K kcluster pct_rep_2004 pct_rep_2008 pct_rep_2012 pct_rep_2016
## <int> <list> <dbl> <dbl> <dbl> <dbl>
## 1 1 <kmeans> 0.595 0.581 0.595 0.620
## 2 2 <kmeans> 0.674 0.680 0.695 0.730
## 3 2 <kmeans> 0.519 0.484 0.499 0.514
## 4 3 <kmeans> 0.580 0.562 0.584 0.620
## 5 3 <kmeans> 0.462 0.420 0.425 0.416
## 6 3 <kmeans> 0.707 0.716 0.727 0.759
## 7 4 <kmeans> 0.531 0.495 0.511 0.533
## 8 4 <kmeans> 0.416 0.374 0.371 0.351
## 9 4 <kmeans> 0.712 0.724 0.734 0.765
## 10 4 <kmeans> 0.601 0.588 0.611 0.648
## # ℹ 45 more rows
## # ℹ 5 more variables: pct_rep_2020 <dbl>, size <int>, withinss <dbl>,
## # cluster <fct>, augmented <list>
So clusters is a tibble that consists of the centroids associated
with each of the centroids in each of the K clusterings we
did.
We can also extract the clusters associated with each of the
observations by doing a similar operation on the augmented
list we created.
points <- kcluster.fl %>%
unnest(cols=c(augmented))
points
## # A tibble: 670 × 9
## K kcluster tidysummary pct_rep_2004 pct_rep_2008 pct_rep_2012
## <int> <list> <list> <dbl> <dbl> <dbl>
## 1 1 <kmeans> <tibble [1 × 8]> 0.429 0.386 0.405
## 2 1 <kmeans> <tibble [1 × 8]> 0.777 0.784 0.789
## 3 1 <kmeans> <tibble [1 × 8]> 0.712 0.699 0.712
## 4 1 <kmeans> <tibble [1 × 8]> 0.696 0.697 0.706
## 5 1 <kmeans> <tibble [1 × 8]> 0.577 0.547 0.558
## 6 1 <kmeans> <tibble [1 × 8]> 0.346 0.324 0.323
## 7 1 <kmeans> <tibble [1 × 8]> 0.634 0.696 0.710
## 8 1 <kmeans> <tibble [1 × 8]> 0.557 0.531 0.567
## 9 1 <kmeans> <tibble [1 × 8]> 0.569 0.574 0.604
## 10 1 <kmeans> <tibble [1 × 8]> 0.762 0.711 0.725
## # ℹ 660 more rows
## # ℹ 3 more variables: pct_rep_2016 <dbl>, pct_rep_2020 <dbl>, .cluster <fct>
So now we can plot the results. To do so we are going to produce
multiple plots by “facet-wrapping” using values K.
p1 <-
ggplot(points, aes(x = pct_rep_2004, y = pct_rep_2020)) +
geom_point(aes(color = .cluster), alpha = 0.8) +
labs(x = "% Republican Vote 2004",
y = "% Republican Vote 2020",
color = "Cluster",
title = "Clusters for Various Choices of K") +
facet_wrap(~ K) +
scale_x_continuous(limits = c(.25,1),labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(limits = c(.25,1),labels = scales::percent_format(accuracy = 1))
p1
Now let’s add in the centroids!
p1 + geom_point(data = clusters, size = 4, shape = "+")
How does Total Within Sum of squares change as clusters increase?
Recall that the within sum of squares for each cluster is simply how far
each data point in a cluster is from the centroid according to squared
Euclidean distance. Thus, if \(T\)
denotes PctTrump and \(C\)
denotes PctClinton the within sum of squares for cluster
\(k\) using the \(n\) counties that are allocated in cluster
\(k\) is given by:
\[WSS_k=\sum_i^n (\bar{T}_k - T_i)^2 + (\bar{C}_k - C_i)^2\]
Given this, the total within sum of squares is simply the sum of the within sum of squares across the k clusters. In other words, if we fit \(K\) clusters, the total within sum of squares is:
\[ TSS = \sum_k^K WSS_k\]
Note that the total sum of squares will usually decrease as the number of clusters increase, Why? Because more clusters means more centroids which will mean smaller squared distances. If, for example, we fit a model with as many centroids as there are observations – i.e., \(K==N\) – then the within sum of squares for every observation would be \(0\) and the total sum of squares would also be \(0\)! Note that the total sum of squares will not always decrease depending on number of clusters because of the dependence on start values. Especially when analyzing many variables, the results become more sensitive it is to start values!
Too see what we have created, let us take a look within the tibble
kcluster.fl and extract the second list item – which is the
set of tibbles summarizing the overall fit.
fits <- kcluster.fl[[2]]
To extract the total within sum-of-squares from this we can write a
loop to extract the information. Note that we are using
[[]] to select an element from a list. Then we can plot the
relationship. For simplicity I have used the standard plot command in
base R rather than
tot.withinss <- NULL
for(i in 1:10){
tot.withinss[i] <- fits[[i]]$tot.withinss
}
fit <- bind_cols(k = seq(1,10), tot.withinss = tot.withinss)
ggplot(fit, aes(x=k,y=tot.withinss)) +
geom_line() +
scale_x_continuous(breaks=seq(1,10)) +
labs(x="Number of Clusters", y ="Total Within Sum of Squares")
OK, so how do we interpret what this means? Or label the clusters sensibly? This again requires the data scientist to examine and mutate the data.
set.seed(13469)
fl.cluster <- kmeans(rawvote, centers=5, nstart = 25)
dat.cluster <- augment(fl.cluster,dat.fl)
tidy(fl.cluster) %>%
arrange(-pct_rep_2020)
## # A tibble: 5 × 8
## pct_rep_2004 pct_rep_2008 pct_rep_2012 pct_rep_2016 pct_rep_2020 size
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.714 0.727 0.737 0.768 0.779 19
## 2 0.619 0.617 0.638 0.678 0.692 14
## 3 0.570 0.541 0.562 0.596 0.607 17
## 4 0.513 0.475 0.493 0.503 0.510 9
## 5 0.416 0.374 0.371 0.351 0.384 8
## # ℹ 2 more variables: withinss <dbl>, cluster <fct>
Now change the order of the factor so that it is ordered from most
Trump supporting to most Clinton Supporting. To do so we need to use the
factor function to re-define the order of the levels.
dat.cluster <- dat.cluster %>%
mutate(cluster = factor(.cluster,
levels=c(3,5,2,4,1)))
Now let’s check that we did this correctly. Let’s see if the clusters are arranged by average Trump support.
dat.cluster %>%
group_by(cluster) %>%
summarize(PctTrump = mean(pct_rep_2020))
## # A tibble: 5 × 2
## cluster PctTrump
## <fct> <dbl>
## 1 3 0.779
## 2 5 0.692
## 3 2 0.607
## 4 4 0.510
## 5 1 0.384
Yes, but the labels are weird and unintuitive. Let’s fix this by
using the factor function to change the labels
associated with each factor value.
dat.cluster <- dat.cluster %>%
mutate(cluster = factor(cluster,
labels=c("Very Strong Rep","Strong Rep","Rep","Toss Up","Strong Dem")))
Now confirm that we did not screw that up.
dat.cluster %>%
group_by(cluster) %>%
summarize(PctTrump = mean(pct_rep_2020))
## # A tibble: 5 × 2
## cluster PctTrump
## <fct> <dbl>
## 1 Very Strong Rep 0.779
## 2 Strong Rep 0.692
## 3 Rep 0.607
## 4 Toss Up 0.510
## 5 Strong Dem 0.384
This seems good to go.
fl.centers <- as.data.frame(fl.cluster$centers)
gg <- ggplot() +
geom_point(data=dat.cluster, aes(x = pct_rep_2020, y = pct_rep_2004, color = cluster,
text=paste(county.name)), alpha = 0.8) +
geom_point(data=fl.centers, aes(x = pct_rep_2020, y = pct_rep_2004), size = 6, shape = "+") +
labs(color = "Cluster",
title = "Florida Counties",
x = "Percentage Vote for Trump",
y = "Percentage Vote for Clinton") +
scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1))
ggplotly(gg,tooltip = "text")
So how well does our classification compare to the one that was used by the networks on Election Night?
table(dat.cluster$cluster,dat.cluster$party.strata)
##
## 1--High Democrat 2--Mod Democrat 3--Middle 4--Mod Republican
## Very Strong Rep 0 0 0 0
## Strong Rep 0 0 0 0
## Rep 0 0 0 9
## Toss Up 0 0 7 2
## Strong Dem 3 5 0 0
##
## 5--High Republican
## Very Strong Rep 19
## Strong Rep 14
## Rep 8
## Toss Up 0
## Strong Dem 0
Interesting….
It is often also useful to summarize the distribution of key variables by the clusters we have found to try to interpret their meaning. Here we can use a boxplot to describe how the county clusters vary in terms of the average support for President Trump.
dat.cluster %>%
ggplot(aes(x=cluster, y=pct_rep_2020)) +
geom_boxplot() +
labs(x="Cluster",
y="Pct Trump",
title="Support for Trump Across Counties in FL")
We could also merge in county-level demographic data (using Census
fips_code) and see how things change if we cluster counties
based on their demographic features. But the important thing to remember
is that because this is an unsupervised method there is no way to
determine if the clustering is what you want it to be. Also recall that
the scale matters. The computer will always find the number of clusters
you ask for, but whether those clusters mean anything is up to you, the
data analyst to determine. This is where critical thinking is essential
– what variables are appropriate to include? And how do you interpret
the meaning of the clusters given the distribution of data within those
clusters?
What if we looked at all counties? Note we are going to drop some
states that do not record vote by counties (e.g., Maine) as well as
others for which we are lacking data for some years (e.g., Alaska).
Let’s create a tibble containing just the data called
rawdata and drop all missing data.
Do we need to standardize? Why or why not?
rawvote <- dat.all %>%
select(c(pct_rep_2004,pct_rep_2008,pct_rep_2012,pct_rep_2016,pct_rep_2020)) %>%
drop_na()
What if we compare Republican vote share in 2004 and 2020. What does that show? Let’s plot and see.
rawvote %>%
ggplot(aes(x=pct_rep_2004, y=pct_rep_2020)) +
geom_point() +
labs(x="% Republican 2004",
y = "% Republican 2020",
title = "Republican Support in Counties: 2004 & 2020") +
geom_abline(intercept=0,slope=1) +
scale_x_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(limits = c(0,1),labels = scales::percent_format(accuracy = 1))
We begin by setting a seed to ensure replicability and then we fit
K different clusters – one for each choice of
K.
set.seed(42)
kcluster.us <-
tibble(K = 1:10) %>% # define a sequence that we will use to denote k
mutate( # now we are going to create new variables in this tibble
kcluster = map(K, ~kmeans(rawvote, .x, iter.max = 100)), # run a kmeans clustering using k
tidysummary = map(kcluster, tidy), # run the tidy() function on the kcluster object
augmented = map(kcluster, augment, rawvote) # save the cluster to the data
)
To plot this we want to extract the data points from
kcluster.us using the unnest function to the
tibble points.us and we want to extract the centroids of
the clusters from the tidysummary for each cluster into the new tibble
clusters.us by unnesting the summaries of each
fit.
points.us <- kcluster.us %>%
unnest(cols=c(augmented))
clusters.us <- kcluster.us %>%
unnest(cols=c(tidysummary))
Now we can use these two new tibbles to plot.
points.us %>%
ggplot(aes(x = pct_rep_2004, y = pct_rep_2020)) +
geom_point(aes(color = .cluster), alpha = 0.8) +
labs(x = "% Republican Vote 2004",
y = "% Republican Vote 2020",
color = "Cluster",
title = "Clusters for Various Choices of K") +
facet_wrap(~ K) +
scale_x_continuous(limits = c(.25,1),labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(limits = c(.25,1),labels = scales::percent_format(accuracy = 1)) +
geom_point(data = clusters.us, size = 4, shape = "+")
So how many clusters? Here we can see what the total within sum of squares is for each set of clusters that we find for the various choices of k. To determine how many, we want to choose a value of k such that there is very little change from adding additional clusters. Note that more clusters will always do better, so the issue is to find out the point at which the improvement seems small. This is a judgment call.
So let’s extract the fits from the kcluster.us list and
then loop over the k different fits to extract the total
within sum of squares (tot.withinss) and then create a new
tibble fit that contains the vector of cluster sizes and
vector of total within sum of squares that we used the loop to extract
(i.e., tot.withinss).
fits <- kcluster.us[[2]]
tot.withinss <- NULL
for(i in 1:10){
tot.withinss[i] <- fits[[i]]$tot.withinss
}
fit <- bind_cols(k = seq(1,10), tot.withinss = tot.withinss)
Now plot to see where the line “bends”.
fit %>%
ggplot(aes(x=k,y=tot.withinss)) +
geom_line() +
scale_x_continuous(breaks=seq(1,10)) +
labs(x="Number of Clusters",
y ="Total Within Sum of Squares",
title = "Fit by Number of Clusters")
So it seems like there are 4 or maybe 5 clusters that seem relevant?
Lots of press attention (and academic/commercial research) to social media communication. Trying to characterize (in the hopes of predicting) behavior and “sentiment.” Many are trying to do this at scale – analyzing the behavior of large numbers of individual users – including using clustering methods to uncover different “types” of users that can be used to understand and predict human opinion and behavior across a wide range of activities and concepts.
Perhaps no individual has been more scrutinized than former President Trump – who was arguably defined by his use of twitter as a medium of communication, agenda-setting, and mobilization. Multiple news stories and academic papers have focused on the corpus of Trump Tweets.
For example, the NY Times did an expose here whereby they had reporters read every single tweet and classify it. (See the methodology here. Remarkably, this was work that was done by hand. Hopefully by the end of class you can see how you could do something similar using R! More similar to what we are going to do is the work by fivethirtyeight.
We used to be able to read in Trump data via the Twitter API, but since that has been deactivated we are going to use data that was collected and posted here.
Note that you could also look at news coverage using data that people have collected on the scrolling chryons at the bottom of the screen here.
As an example of a recent publication in Nature: Humanities and Social Sciences Communications using sentiment analysis see: Sentiments and emotions evoked by news headlines of coronavirus disease (COVID-19) outbreak.
Let’s load the packages we are going to use into memory. You may need to install some of these.
library(readr)
library(tidyverse)
library(lubridate)
library(scales)
Just to give you a sense of the preprocessing, here I read in a csv file and did some manipulations
trumptweets <- read_csv("https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/trump_tweets.csv")
#View(trumptweets)
glimpse(trumptweets)
tweets <- trumptweets %>%
select(id, text, date, retweets, favorites) %>%
mutate(date = with_tz(date, "EST"),
Tweeting.hour = format(date, format = "%H"),
Tweeting.date = format(date, format = "%Y-%m-%d"),
Tweeting.date = as.Date(Tweeting.date),
Tweeting.year = as.factor(format(date, format = "%Y")))
First thing we always do is to see what we have so we know what we
are working with and what we have to do to answer the questions we want
to answer. Then we select the variables we want and practice creating
some time objects for future use. (We are using the
lubridate package for the date manipulations.) Note that
date had the time of the tweet in UTC so we used the
with_tz function from lubridate package to
change the time zone to Eastern Standard Time (as that is where
Washington, DC, Bedminster, NJ, and Mar Lago, FL are located) – note
that dates are also changed if the timezone change crosses days (a
benefit of saving the object as a date object!).
Because our data is larger than usual, we want to keep an eye on how
much is loaded into memory. Since we no longer need
trumptweets let us remove it via rm().
rm(trumptweets)
Let’s focus on asking the question of how Trump’s Twitter behavior changed over time. This is a broad question, so let’s break it up into a few specific questions to tackle using the tools we have talked about thus far to help demonstrate that based on the concepts and code we have talked about in class you are able to do quite a bit.
All of this can be done at the level of the tweet using tools we have previously used and discussed in class – i.e., no text analysis required. So we will start there. Sometimes simple analyses will get you what you need and complexity for the sake of complexity is not a virtue.
After using the data – and conditional means – to answer these descriptive questions we can then pivot to analyze what was being tweeted about using several tools that will get into the analysis of the content of the text being tweeted.
kmeans)Note that we are going to compare pre-post presidency but you have the tools, and also the code based on what we do today, to do much finer-grained analyses. You can compare how the behavior changes each year. Or each month? Or in response to his impeachments. Or how his activity in the 2016 campaign compares to his activity in his 2020 campaign. Or dive into the 2016 campaign and see how he acted and reacted during his rise to the Republican nomination. There are many, many fascinating (and largely unanswered) questions that you should be empowered to be able to do based on what we cover! We will dive deeper a few times, but largely to illustrate the amazing stuff that you can do.
So let’s get to it!
tweets <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Trumptweets.Rds")
So let’s start by describing our data graphically. How frequently was President Trump tweeting throughout the time series we possess?
tweets %>%
group_by(Tweeting.date) %>%
count() %>%
ggplot() +
geom_point(aes(x=Tweeting.date,y=n),alpha=.4) +
labs(x="Date",y="Number of Tweets",title="Tweeting by Trump")
Here each point is the number of tweets in a day. Some days there was clearly a lot of activity. Moreover we can see that Trump was more active before becoming President and his frequency of tweeing increased over his presidency.
We can also consider how did the number of retweets, on average,
change across days on which a tweet occurred? (Here we are using the
scales library to set the scale_y_continuous
to label numbers with commas (label=comma).)
tweets %>%
group_by(Tweeting.date) %>%
summarize(AvgRetweet = mean(retweets)) %>%
ggplot() +
geom_point(aes(x=Tweeting.date,y=AvgRetweet),alpha=.4) +
labs(x="Date",y="Average Retweets",title="Tweeting by Trump") +
scale_y_continuous(label=comma)
Clearly there is a lot of variation here. Which tweets were re-tweeted the most?
tweets %>%
select(content,retweets) %>%
top_n(retweets,n=10) %>%
arrange(-retweets)
## # A tibble: 10 × 2
## content retweets
## <chr> <dbl>
## 1 "Tonight, @FLOTUS and I tested positive for COVID-19. We will begin… 408866
## 2 "#FraudNewsCNN #FNN https://t.co/WYUnHjjUjg" 293109
## 3 "TODAY WE MAKE AMERICA GREAT AGAIN!" 281289
## 4 "Are you allowed to impeach a president for gross incompetence?" 237674
## 5 "RT @SpaceX: Liftoff! https://t.co/DRBfdUM7JA" 235250
## 6 "A$AP Rocky released from prison and on his way home to the United … 226235
## 7 "\"Why would Kim Jong-un insult me by calling me \"\"old,\"\" when … 217199
## 8 "Be prepared, there is a small chance that our horrendous leadershi… 211345
## 9 "The United States of America will be designating ANTIFA as a Terro… 210615
## 10 "RT @realDonaldTrump: The United States of America will be designat… 210614
Now can you do the same to identify the tweets that were selected as a favorite? How does this list compare to the tweets that were most likely to be retweeted? Can you write this code?
# INSERT CODE HERE
In general, how should we measure influence/impact? Favorites or retweets? Does the choice matter? Let’s look at the relationship to see how consequential the determination is.
tweets %>%
ggplot(aes(x=retweets, y=favorites)) +
geom_point() +
scale_x_continuous(label=comma) +
scale_y_continuous(label=comma) +
labs(x= "Number of Retweets", y = "Number of Times Favorited",title="Trump Tweets: Relationship between Retweets and Favorites")
In general they seem pretty related, but there are a handful of tweets that are retweeted far more than they are favorited. (On your own, can you figure out which ones these are?)
We can also use plotly to create an HTML object with
each point denoting how many retweets each tweet receieved and the date
of the tweet. We can use this to label the tweets to get a better sense
of what tweets were most re-tweeted? (This will be a very large plot
given the number of tweets and the length of the content being pasted
into the object! To keep things manageable let’s focus on the top
75th-percentile of tweets.)
library(plotly)
gg <- tweets %>%
filter(retweets > quantile(retweets,.75)) %>%
ggplot(aes(x=Tweeting.date,y=retweets,text=paste(content))) +
geom_point(alpha=.4) +
labs(x="Date",y="Retweets",title="Tweeting by Trump") +
scale_y_continuous(label=comma)
ggplotly(gg,tooltip = "text")
On your own, can you do the same for the most favorited tweets?
# INSERT CODE HERE
In addition to looking at the change over time we can also look at
the hour at which Trump was tweeting using the hour
variable we created. To start let’s do the total number of tweets at
each hour across the entire corpus.
tweets %>%
group_by(Tweeting.hour) %>%
count() %>%
ggplot() +
geom_point(aes(x=as.numeric(Tweeting.hour),y=n),alpha=.4) +
labs(x="Hour",y="Number of Tweets",title="Tweeting by Trump: Hour of Day (EST)") +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
Did Trump’s use to twitter change after he was elected President? Certainly we would think that the content might change – as well as how likely it was to be favorited and retweeted – but how about the frequency and timing of the tweets?
Let’s create an indicator variable PostPresident using
the date variable to define whether the date is before
(FALSE) or after (TRUE) his election as
president (we could also use the inauguration date, but some claimed
that his behavior would change once he was officially projected.)
tweets %>%
mutate(PostPresident = Tweeting.date > as.Date("2016-11-03")) %>%
group_by(PostPresident,Tweeting.hour) %>%
count() %>%
ggplot() +
geom_point(aes(x=as.numeric(Tweeting.hour),y=n,color=PostPresident),alpha=.4) +
labs(x="Hour",y="Number of Tweets",title="Tweeting by Trump: Hour of Day (EST)",color="Is President?") +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
What do you observe?
But is it right to use the overall frequency? Do we prefer the proportion of tweets that were set at each hour pre/post presidency?
Let’s use mutate to compute the proportion of tweets
that occur at each hour in each period and then plot those using
color to denote the pre/post time period.
tweets %>%
mutate(PostPresident = Tweeting.date > as.Date("2016-11-03")) %>%
group_by(PostPresident,Tweeting.hour) %>%
count() %>%
ungroup(Tweeting.hour) %>%
mutate(Prop = n/sum(n)) %>%
ggplot() +
geom_point(aes(x=as.numeric(Tweeting.hour),y=Prop,color=PostPresident),alpha=.4) +
labs(x="Hour (EST)",y="Percentage of Tweets in Period",title="Tweeting by Trump: Hour of Day (EST)",color="Is President?") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
Hmm. A very different characterization! Always think about what the right calculation is. R will do what you tell it – usually ;) – but you need to think about what you are asking it to do!
We could also go deeper and look at variation by year. Here is a graph of the overall frequency. (Can you do the same for proportions?)
tweets %>%
group_by(Tweeting.year,Tweeting.hour) %>%
count() %>%
ggplot() +
geom_point(aes(x=as.numeric(Tweeting.hour),y=n,color=Tweeting.year),alpha=.4) +
labs(x="Hour",y="Number of Tweets",title="Tweeting by Trump: Hour of Day (EST)",color="Year") +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
Can you graph the proportion of tweets per hour per year? How does that change your characterization. In you opinion, which is the more appropriate measure? The number of tweets or the proportion of tweets? Why or why not?
tweets %>%
group_by(Tweeting.year,Tweeting.hour) %>%
count() %>%
ungroup(Tweeting.hour) %>%
mutate(Prop = n/sum(n)) %>%
ggplot() +
geom_point(aes(x=as.numeric(Tweeting.hour),y=Prop,color=Tweeting.year),alpha=.4) +
labs(x="Hour (EST)",y="Percentage of Tweets in Period",title="Tweeting by Trump: Hour of Day (EST)",color="Year") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
Here we put everything on the same graph, but maybe it makes sense to
create separate graphs - one for each value that we are using to define
the color. To do this we just need to use a
facet_wrap to create a buynch of graphs that will “wrap
around” the screen starting from the lowest value of the facet defined
after the ~ and arranged in a grid with nrow=4
(Try different values here!). We have defined
scales = "free_y" to let the graphs vary in the scales of
the y-axis (because the frequencies vary). We could also choose
"fixed" to give every graph the same x and y limits, or
"free_x" to use the same y-axis scale and allow the x-axis
to vary. scale="free" allows both the x and y axes to vary.
Experiment with what happens if you change the scale. Why did I do what
we did?
tweets %>%
group_by(Tweeting.year,Tweeting.hour) %>%
count() %>%
ggplot() +
facet_wrap(~ Tweeting.year, scales = "free_y", nrow = 4) +
geom_point(aes(x=as.numeric(Tweeting.hour),y=n,color=Tweeting.year),alpha=.4) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size=5)) +
labs(x="Hour",y="Number of Tweets",title="Tweeting by Trump: Hour of Day (UTC)",color="Year") +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
First try to do the same thing with the tweeting behavior Pre/Post
Presidency. Can you use the facet_wrap to create that
visualization? Which visualization do you prefer?
# INSERT CODE HERE
Now see if you can pull this together by plotting the time of
tweeting by year but graphing the proportions this time. How should we
define the scales in this case?
# INSERT CODE HERE
You can skip this section and just load in the data in the next
section, but this is if you want to know what I did to get the data from
above ready for the analyses we do below. It involves working with the
string variable content and deleting elements of that
string to leave us only word “tokens.”
OK so that is what we can do at the level of the tweet. To analyze the content of the tweet we need to transform the string of each tweet into “tokens” (words) that we can then analyze. The first part is often the hardest – data-wrangling is typically 85% of the issue in data science. Rather than give you the cleaned data, here is a sense of what you need to do to make it work. Do not worry about understanding the content at this point.
The following is included for the interested student to get a sense
of what is required to convert the content into tokens.
Recall that our “data” looks like this:
tweets$content[1]
## [1] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!"
And we need to transform that into something we can analyse! This takes some work…
library(qdapRegex)
library(tm)
library(tidytext)
library(SnowballC)
First we are going to strip out all of the url and twitter-formatted url from the tweet text using some pre-defined functions.
tweets <- tweets %>%
mutate(content = rm_twitter_url(content),
content = rm_url(content),
document = id)
Now we are going to write a function that takes as an argument a
string (x) and then uses multiple functions to remove
strings satisfying certain conditions.
First we are going to process the string content to
remove combinations of letters/numbers that are not words. To do so we
are going to define a function called clean_tweets and then
apply it to the content variable in tweets
tibble.
clean_tweets <- function(x) {
x %>%
# Remove mentions e.g. "@my_account"
str_remove_all("@[[:alnum:]_]{4,}") %>%
# Remove mentions e.g. "@ my_account"
str_remove_all("@ [[:alnum:]_]{4,}") %>%
# Remove hashtags
str_remove_all("#[[:alnum:]_]+") %>%
# Remove hashtags
str_remove_all("# [[:alnum:]_]+") %>%
# Remove twitter references
str_remove_all("twitter[[:alnum:]_]+") %>%
# Remove twitter pics references
str_remove_all("pictwitter[[:alnum:]_]+") %>%
# Replace "&" character reference with "and"
str_replace_all("&", "and") %>%
# Remove punctuation, using a standard character class
str_remove_all("[[:punct:]]") %>%
# Remove "RT: " from beginning of retweets
str_remove_all("^RT:? ") %>%
# Replace any newline characters with a space
str_replace_all("\\\n", " ") %>%
# Make everything lowercase
str_to_lower() %>%
# Remove any trailing whitespace around the text
str_trim("both")
}
# Now apply this function to the `content` of `tweets`
tweets$content <- clean_tweets(tweets$content)
Now that we have pre-processed the content string lets
do some more more wrangling to extract word “tokens” from this string
and then also remove the tokens that are not meaningful words. Let’s
also define the object reg containing all the various
characters that an be used.
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- tweets %>%
filter(!str_detect(content, '^"')) %>%
unnest_tokens(word, content, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,str_detect(word, "[a-z]")) %>%
mutate(word = str_replace_all(word, "\\d+", "")) %>% # drop numbers
mutate(word = str_replace_all(word, "twitter[A-Za-z\\d]+|&", "")) %>%
mutate(word = str_replace_all(word, "pictwitter[A-Za-z\\d]+|&", "")) %>%
mutate(word = str_replace_all(word, "pic[A-Za-z\\d]+|&", "")) %>%
mutate(word = str_replace_all(word, "pic", "")) %>%
mutate(word = str_replace_all(word, "againpic[A-Za-z\\d]+|&", "")) %>%
# mutate(word = wordStem(word)) %>%
mutate(document = id) %>%
select(-id) %>%
filter(word != "") # drop any empty strings
Now use the anti_join to remove all
stop_words to focus on the words with “content.”
data("stop_words", package = "tidytext")
tweet_words <- anti_join(tweet_words, stop_words, by = "word")
We can also just read in the already-wrangled data
tweet_words and proceed from here.
tweet_words <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Trump_tweet_words.Rds")
So what are the most commonly tweeted word stems?
tweet_words %>%
count(word) %>%
arrange(-n)
## # A tibble: 45,221 × 2
## word n
## <chr> <int>
## 1 trump 6269
## 2 president 4637
## 3 amp 4306
## 4 people 3475
## 5 country 2302
## 6 america 2211
## 7 time 1913
## 8 donald 1891
## 9 news 1842
## 10 democrats 1824
## # ℹ 45,211 more rows
Let’s plot the 20 most frequently used words in descending order using a barplot.
tweet_words %>%
count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_bar(stat = "identity") +
ylab("Occurrences") +
scale_y_continuous(label=comma) +
coord_flip()
Interesting. But we want to know how twitter use changed, if at all,
over time and in response to being elected president. So let’s start by
defining a variation that is PostPresident defined to be
tweets after being projected as the winner of the 2016 Presidential
election.
NOTE: You could also look at other variation (e.g., years, pre/post certain events, etc.). There are lots of opportunities to expand/refine this! Try some!
tweet_words <- tweet_words %>%
mutate(PostPresident = Tweeting.date > as.Date("2016-11-03"))
To compare lets compare the top 10 word stems that were tweeted pre-presidency.
tweet_words %>%
filter(PostPresident == FALSE) %>%
select(word) %>%
count(word) %>%
top_n(10, wt= n) %>%
arrange(-n)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 trump 4192
## 2 amp 2218
## 3 president 1757
## 4 donald 1648
## 5 people 1265
## 6 obama 1178
## 7 america 1110
## 8 run 1036
## 9 dont 978
## 10 time 949
And the top 10 stems tweeted post-presidency.
tweet_words %>%
filter(PostPresident == TRUE) %>%
select(word) %>%
count(word) %>%
top_n(10, wt= n) %>%
arrange(-n)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 president 2880
## 2 people 2210
## 3 amp 2088
## 4 trump 2077
## 5 democrats 1736
## 6 news 1570
## 7 country 1357
## 8 fake 1265
## 9 america 1101
## 10 american 1089
Putting together in a nicer table using group_by.
tweet_words %>%
group_by(PostPresident) %>%
select(word) %>%
count(word) %>%
top_n(10, wt= n) %>%
arrange(-n) %>%
summarise(top_words = str_c(word, collapse = ", "))
## # A tibble: 2 × 2
## PostPresident top_words
## <lgl> <chr>
## 1 FALSE trump, amp, president, donald, people, obama, america, run, don…
## 2 TRUE president, people, amp, trump, democrats, news, country, fake, …
And now graphing them using a wordcloud. (Why are we
setting a seed?)
library(wordcloud)
set.seed(42)
tweet_words %>%
filter(PostPresident == FALSE) %>%
select(word) %>%
count(word) %>%
{ wordcloud(.$word, .$n, max.words = 100) }
tweet_words %>%
filter(PostPresident == TRUE) %>%
select(word) %>%
count(word) %>%
{ wordcloud(.$word, .$n, max.words = 100) }
But what about variation over time? Lots of events happened every year and looking at all tweets before 2016 compared to all of the tweets after Election Day 2016 may lose important nuance and variation. So let’s look at the 10 most frequently tweeted words each year.
tweet_words %>%
group_by(Tweeting.year) %>%
select(word) %>%
count(word) %>%
top_n(10, wt= n) %>%
arrange(-n) %>%
summarise(top_words = str_c(word, collapse = ", ")) %>%
knitr::kable()
| Tweeting.year | top_words |
|---|---|
| 2009 | donald, trump, httptinyurlcompqpfvm, champion, trumps, book, watch, david, dont, enter, happy, read, signed |
| 2010 | pm, apprentice, trump, nbc, miss, tonight, fantastic, night, tune, episode, hotel |
| 2011 | cont, interview, china, pm, america, trump, book, watch, debt, jobs |
| 2012 | cont, obama, amp, trump, china, interview, people, dont, time, discussing |
| 2013 | trump, amp, people, donald, obama, president, true, dont, time, love |
| 2014 | trump, president, amp, donald, run, obama, country, love, true, dont, vote |
| 2015 | trump, donald, president, amp, america, run, people, country, love, time |
| 2016 | hillary, trump, amp, america, people, clinton, crooked, cruz, vote, join |
| 2017 | amp, people, news, fake, america, president, tax, trump, country, american |
| 2018 | amp, people, president, trump, country, democrats, news, border, fake, time |
| 2019 | president, amp, democrats, trump, people, country, news, impeachment, border, fake |
| 2020 | president, trump, people, biden, joe, news, democrats, election, vote, american |
| 2021 | georgia, election, votes, people, th, dc, january, senators, vote, republican |
An now, how about by hour? What is on President Trump’s mind, on average, at every hour of the day?
# INSERT CODE HERE
Pushing ahead, we could also do hour by pre/post presidency (or year) to see how the content changed. Or perhaps how activity varies across parts of the day by creating periods of time based on the hours (e.g., “late-night”, “early morning”, “normal work-day”). Here is where you as a data-scientist can propose (and defend!) categorizations that are useful for understanding the variation in the data.
So far we have focused on frequency of word use, but another way to make this comparison is to look at the relative “odds” that each word is used pre/post presidency. After all, “Trump” is used by Trump both before and after his presidency so perhaps that is not a great indicator of content. We could instead consider the relative rate at which a word is used Post-Presidency relative to Pre-presidency.
We are going to count each word stem use pre and post-presidency,
then select only those words that were used at least 5 times, then
spread the data so that if a word appears Pre-Presidency
but not Post-Presidency (or visa-versa) we will create a matching word
with the filled in value of 0, then we are going to ungroup the data so
that the observation is now a word rather than a word-timing combination
(look to see how the tibble changes before and after the ungroup() by
running these code snippets separately to see). Then we are going to
mutate_each to compute the fraction of times a word is used
relative to all words (the . indicates the particular value
of each variable – note that we are adding a + 1 to each of
those values to avoid errors when taking the log later). We then compute
the ratio by computing the relative frequency of each word
used pre and post presidency and take the log of that ratio because of
extreme outliers before arranging the tibble in decreasing value of
ratio
So let’s compute the log odds ratio for each word pre and post presidency.
prepost_ratios <- tweet_words %>%
count(word, PostPresident) %>%
filter(sum(n) >= 5) %>%
spread(PostPresident, n, fill = 0) %>%
ungroup() %>%
mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%
mutate(ratio = `TRUE` / `FALSE`) %>%
mutate(logratio = log(ratio)) %>%
arrange(-logratio)
Now let’s plot the top 15 most distinctive words (according to the log-ratio we just computed) that were tweeted before and after Trump was elected president.
prepost_ratios %>%
group_by(logratio > 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Post-President/Pre-President log ratio") +
scale_fill_manual(name = "", labels = c("President", "Pre-President"),
values = c("red", "lightblue"))
You could look at other splits. Pre-post his first impeachment? 2016 versus 2017? Note that the log-ratio is a comparison of a binary condition.
How else can we summarize/describe data? Cluster Analysis via
kmeans?
But using what data? Should we focus on the number of words being used? The proportion of times a word is used in a particular document? Or some other transformation that tries to account for how frequently a word is used in a particular document relative to how frequently it is used in the overall corpus?
We are going to use the text analysis function
bind_tf_idf that will take a document-term matrix and
compute the fraction of times each word is used in each document
(tf = “term frequency”). It also computes a transformation
called tf-idf that balances how frequently a word is used relative to
its uniqueness in a document.
For word w in document d we can compute the
tf-idf using: \[
tf-idf(w,d) = tf(w,d) \times log \left( \frac{N}{df(w)}\right)
\] where tf is the term frequency (word count/total
words), df(w) is the number of documents in the corpus that
contain the word, and N is the number of documents in the
corpus. The inverse-document-frequency idf for each word
w is therefore the number of documents in the corpus
N over the number of documents containing the word.
NOTE: what is a document? In theory, a document is a tweet. However,
tweets are so short that most words only appear once. Furthermore, there
are a LOT of tweets written by Trump over his time on the platform.
Instead, we will treat a DAY (Tweeting.date) as our
“document”. Be aware what this implies! We are assuming that Trump’s
tweets written on a given day are about the same thing. This is
obviously not always true, but it is a reasonable assumption to start
with.
So let us create a new document-term-matrix object that also includes
the tf, idf and tf_idf associated
with each word.
# Create the dtm with Tweeting.date as the "document".
dtm <- tweet_words %>%
count(Tweeting.date,word) %>%
group_by(word) %>%
mutate(tot_n = sum(n)) %>%
ungroup() %>%
filter(tot_n > 20) # Drop words that appear less than 20 total time across the entire data
require(tidytext)
dtm.tfidf <- bind_tf_idf(tbl = dtm, term = word, document = Tweeting.date, n = n) # Calculate TF-IDF
dtm.tfidf %>%
select(word,tf_idf) %>%
distinct() %>%
arrange(-tf_idf) %>%
slice(1:10)
## # A tibble: 10 × 2
## word tf_idf
## <chr> <dbl>
## 1 gma 5.21
## 2 generation 4.90
## 3 dead 3.19
## 4 apprentice 2.58
## 5 weekly 2.40
## 6 appearance 2.26
## 7 answers 2.11
## 8 sounds 2.10
## 9 football 2.01
## 10 defeat 1.90
So kmeans took a matrix where each column was a
different variable that we were interested in using to characterize
patterns but the data we have is arranged in a
one-term-per-document-per-row. To transform the data into the format we
require we need to “recast” the data so that each word is a separate
variable – meaning that the number of variables is the number of of
unique word stems.
castdtm <- cast_dtm(data = dtm.tfidf, document = Tweeting.date, term = word, value = tf_idf)
And now we can run \(k\)-means on
this object! How many centers (clusters or \(K\)) should we choose? This all depends on
how many different things we think Trump talks about. For now, we will
just use 50. However, recall that we should create an “elbow” plot like
we did in the previous lecture, and choose \(k\) based on where the reductions in errors
are smaller. (NB: this may take some time on your computer if it is low
on memory.)
set.seed(42)
km_out <- kmeans(castdtm,
centers = 50, # 50 "topics"
nstart = 5) # Set nstart = 5 to make sure this doesn't take forever!
So how many documents are associated with each cluster?
table(km_out$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 12 1 1 4 16 36 156 1 1 9 2 34 3066 2 1 1
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 1 5 1 4 1 1 15 1 1 1 1 4 8 4 3 2
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 1 3 1 12 5 3 1 7 4 1 3 17 1 2 10 1
## 49 50
## 17 7
So let’s tidy it up to see the centroids – here mean frequency– associated with each word in each cluster.
require(broom)
tidy(km_out)
## # A tibble: 50 × 3,180
## david donald late letterman list night ten tonight top
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.00970 0 0 0 0.0783 0 0.0898 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0.00861 0 0.0297 0
## 6 0 0.250 0 0 0.0209 0.0170 0.0201 0 0.0112
## 7 0.000647 0.00121 0.00114 0 0.000271 0.00859 0 0.00427 0.00160
## 8 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0.607
## 10 0 0.0559 0 0 0 0 0 0 0
## # ℹ 40 more rows
## # ℹ 3,171 more variables: trump <dbl>, tune <dbl>, watch <dbl>,
## # apprentice <dbl>, book <dbl>, celebrity <dbl>, champion <dbl>,
## # discuss <dbl>, morning <dbl>, tomorrow <dbl>, view <dbl>, finale <dbl>,
## # financial <dbl>, funny <dbl>, learned <dbl>, post <dbl>, build <dbl>,
## # fired <dbl>, id <dbl>, ive <dbl>, miss <dbl>, usa <dbl>, walls <dbl>,
## # discussing <dbl>, interview <dbl>, listen <dbl>, sense <dbl>, …
Very hard to summarize given that there are 3180 columns! (Good luck trying to graph that!)
How can we summarize? Want to gather() to convert the
data to long form.
km_out_tidy <- tidy(km_out) %>%
gather(word,mean_tfidf,-size,-cluster,-withinss) %>% # Convert to long data (don't add "size", "cluster", and "withinss" to the new "word" column! These are part of the tidy() output!)
mutate(mean_tfidf = as.numeric(mean_tfidf)) # Calculate average TF-IDF
km_out_tidy %>%
filter(word=="apprentice") %>%
print(n=50)
## # A tibble: 50 × 5
## size withinss cluster word mean_tfidf
## <int> <dbl> <fct> <chr> <dbl>
## 1 12 12.6 1 apprentice 0
## 2 1 0 2 apprentice 0
## 3 1 0 3 apprentice 2.58
## 4 4 7.12 4 apprentice 0
## 5 16 12.4 5 apprentice 0
## 6 36 43.0 6 apprentice 0
## 7 156 54.5 7 apprentice 0.00285
## 8 1 0 8 apprentice 0
## 9 1 0 9 apprentice 0
## 10 9 9.95 10 apprentice 0
## 11 2 1.90 11 apprentice 0.215
## 12 34 28.1 12 apprentice 0.300
## 13 3066 720. 13 apprentice 0.00269
## 14 2 0.316 14 apprentice 0
## 15 1 0 15 apprentice 0
## 16 1 0 16 apprentice 0
## 17 1 0 17 apprentice 0
## 18 5 2.87 18 apprentice 0.0368
## 19 1 0 19 apprentice 0
## 20 4 6.52 20 apprentice 0
## 21 1 0 21 apprentice 0
## 22 1 0 22 apprentice 0
## 23 15 24.8 23 apprentice 0
## 24 1 0 24 apprentice 0
## 25 1 0 25 apprentice 0
## 26 1 0 26 apprentice 0
## 27 1 0 27 apprentice 0
## 28 4 2.59 28 apprentice 0
## 29 8 11.5 29 apprentice 0
## 30 4 2.26 30 apprentice 0
## 31 3 2.42 31 apprentice 0
## 32 2 1.99 32 apprentice 0
## 33 1 0 33 apprentice 0
## 34 3 4.43 34 apprentice 0
## 35 1 0 35 apprentice 0
## 36 12 10.5 36 apprentice 0.0134
## 37 5 4.12 37 apprentice 0
## 38 3 4.75 38 apprentice 0.215
## 39 1 0 39 apprentice 0
## 40 7 3.78 40 apprentice 0
## 41 4 3.08 41 apprentice 0.130
## 42 1 0 42 apprentice 0
## 43 3 5.02 43 apprentice 0
## 44 17 13.9 44 apprentice 0.297
## 45 1 0 45 apprentice 0
## 46 2 1.48 46 apprentice 0
## 47 10 5.18 47 apprentice 0
## 48 1 0 48 apprentice 0
## 49 17 17.1 49 apprentice 0.00798
## 50 7 9.56 50 apprentice 0.0409
This tells us that the word “apprentice” is only weakly associated with topic (“cluster”) 7 (0.00285), but is strongly associated with topic 3 (2.58). We can also see how many documents (i.e., days) are associated with each topic. Topic 2 only appears once, while topic 13 appears 3,066 times.
Let’s try plotting just the third topic to better understand what it
is about. To do this, we want to look at the top 10 highest scoring
words according to mean_tfidf.
km_out_tidy %>%
filter(cluster %in% 13) %>%
group_by(cluster) %>%
arrange(-mean_tfidf) %>%
slice(1:10)
## # A tibble: 10 × 5
## # Groups: cluster [1]
## size withinss cluster word mean_tfidf
## <int> <dbl> <fct> <chr> <dbl>
## 1 3066 720. 13 amp 0.0104
## 2 3066 720. 13 trump 0.0102
## 3 3066 720. 13 president 0.00920
## 4 3066 720. 13 america 0.00804
## 5 3066 720. 13 donald 0.00786
## 6 3066 720. 13 obama 0.00698
## 7 3066 720. 13 people 0.00696
## 8 3066 720. 13 country 0.00663
## 9 3066 720. 13 vote 0.00657
## 10 3066 720. 13 hillary 0.00645
This appears to be about domestic politics and elections! For those who are familiar with Trump’s time in office, he would frequently talk about former President Obama and his 2016 opponent Hillary Clinton.
We can turn this into a plot for easier visualization (and look at multiple topics at once).
km_out_tidy %>%
filter(cluster %in% 1:9) %>%
group_by(cluster) %>%
arrange(-mean_tfidf) %>%
slice(1:10) %>%
ggplot(aes(x = mean_tfidf,y = reorder(word,mean_tfidf),
fill = factor(cluster))) +
geom_bar(stat = 'identity') +
facet_wrap(~cluster,scales = 'free') +
labs(title = 'k-means Clusters',
subtitle = 'Clustered by TF-IDF',
x = 'Centroid',
y = NULL,
fill = 'Cluster ID')
We can also assign each topic to the “documents” it is associated
with. To do this, we need to create a new dataset as follows: - Get the
document names from the castdtm object. These are stored in the
dimnames variable under Docs. - Get the
cluster (topics) for each document from the km_out object.
These are stored in the cluster variable. - R
automatically converted the document names to character representations
of the numeric version of the data. We need to convert these back to
dates using as.Date() combined with
as.numeric(). Since these are stored as raw numbers, we
must specify the origin = "1970-01-01" in order for the
as.Date() function to work properly.
doc_cluster <- data.frame(Tweeting.date = castdtm$dimnames$Docs, # Get the document names from the castdtm object.
cluster = km_out$cluster) %>% # Get the topics from thr km_out object
as_tibble() %>%
mutate(Tweeting.date = as.Date(as.numeric(Tweeting.date),origin = '1970-01-01')) # Convert the document names back to date formats
doc_cluster
## # A tibble: 3,492 × 2
## Tweeting.date cluster
## <date> <int>
## 1 2009-05-04 18
## 2 2009-05-05 12
## 3 2009-05-08 18
## 4 2009-05-12 6
## 5 2009-05-13 10
## 6 2009-05-14 6
## 7 2009-05-15 28
## 8 2009-05-16 6
## 9 2009-05-17 6
## 10 2009-05-18 6
## # ℹ 3,482 more rows
So topics 18 and 12 were the focus of Trump’s tweets on his first two days on the platform, followed by multiple days where he emphasized topic 6.
Let’s plot these to make it easier to see patterns.
doc_cluster %>%
ggplot(aes(x = Tweeting.date,y = factor(cluster))) +
geom_tile() +
labs(title = 'Topics by Date',
subtitle = 'Donald Trump Twitter Account',
x = 'Date Tweeted',
y = 'Topic Number')
There are basically 3 (maybe 4?) core topics from our data. Topic 7 is Trump’s main focus prior to 2016 when he starts campaigning. Then it shifts to topic 13 throughout the campaign and during his presidency. Let’s look at these two topics!
km_out_tidy %>%
filter(cluster %in% c(7,13)) %>%
group_by(cluster) %>%
arrange(-mean_tfidf) %>%
slice(1:10) %>%
mutate(topic = ifelse(cluster == 13,'2: Campaign & Pres.','1: Pre-campaign')) %>%
ggplot(aes(x = mean_tfidf,y = reorder(word,mean_tfidf),
fill = factor(cluster))) +
geom_bar(stat = 'identity') +
facet_wrap(~topic,scales = 'free') +
labs(title = 'k-means Clusters',
subtitle = 'Clustered by TF-IDF',
x = 'Centroid',
y = NULL,
fill = 'Cluster ID')
Amazing! Using just \(k\)-means with a bag-of-words of Trump’s tweets, we have a clear idea of what he talked about during different periods!
We are going to start from the prepared
Trump_tweet_words.Rds file from last time. Let’s load it
in, along with a bunch of useful packages for text analysis.
require(tidyverse)
require(tidytext)
require(scales)
tweet_words <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Trump_tweet_words.Rds") %>% # Note you can add data manipulations directly to the code that opens the file!
mutate(PostPresident = Tweeting.date > "2016-11-03")
So far we have focused on frequency of word use, but another way to make this comparison is to look at the relative “odds” that each word is used pre/post presidency. After all, “Trump” is used by Trump both before and after his presidency so perhaps that is not a great indicator of content. We could instead consider the relative rate at which a word is used Post-Presidency relative to Pre-presidency.
We are going to count each word stem used pre and post-presidency,
then select only those words that were used at least 5 times, then
spread the data so that if a word appears Pre-Presidency
but not Post-Presidency (or visa-versa) we will create a matching word
with the filled in value of 0, then we are going to ungroup the data so
that the observation is now a word rather than a word-timing combination
(look to see how the tibble changes before and after the ungroup() by
running these code snippets separately to see). Then we are going to
mutate_each to compute the fraction of times a word is used
relative to all words (the . indicates the particular value
of each variable – note that we are adding a + 1 to each of
those values to avoid errors when taking the log later). We then compute
the ratio by computing the relative frequency of each word
used pre and post presidency and take the log of that ratio because of
extreme outliers before arranging the tibble in decreasing value of
ratio.
So let’s compute the log odds ratio for each word pre and post presidency.
prepost_ratios <- tweet_words %>%
count(word, PostPresident) %>%
filter(sum(n) >= 5) %>%
spread(PostPresident, n, fill = 0) %>%
ungroup() %>%
mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%
mutate(ratio = `TRUE` / `FALSE`) %>%
mutate(logratio = log(ratio)) %>%
arrange(-logratio)
Now let’s plot the top 15 most distinctive words (according to the log-ratio we just computed) that were tweeted before and after Trump was elected president.
prepost_ratios %>%
group_by(logratio > 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Post-President/Pre-President log ratio") +
scale_fill_manual(name = "", labels = c("President", "Pre-President"),
values = c("red", "lightblue"))
You could look at other splits. Pre-post his first impeachment? 2016 versus 2017? Note that the log-ratio is a comparison of a binary condition.
Everything we have done so far is “basic” in the sense that we are looking at frequencies and proportions. We have not done anything particularly fancy (apart from perhaps defining the log odds-ratio but even that is just applying a function to our data).
A prominent tool used to analyze text is called “sentiment analysis.” Unlike clustering algorithms that we discussed earlier that were unsupervised, “sentiment analysis” is an analysis that classifies the meaning/sentiment of text based on a dictionary of words that are assumed to be true. That is, we are using an object with assumed meaning to learn about the characteristics of a text in terms of concepts that are predefined and predetermined.
Sentiment analysis sounds like it is very complex, and it is because of the complexity of language and how the meaning/sentiment of words can change based on context.
Analyzing sentiment requires using a dictionary of predefined sentiment and then using the frequency (or a similar measure) of words with sentiment to classify a text. If we have information on the sentiment of a tweet (perhaps via crowdsourcing) then we can use the methods of prior classes to try to determine how close other tweets are to the tweets that have been identified as belonging to each sentiment – i.e., we can use features of a tweet to classify it given the relationship of known tweets.
If we do not have this predefined tweet-level sentiment, we can characterize sentiment by counting the number of times a set of predefined words are used and then using the resulting distributions to interpret the “sentiment” of the text. Note that it is always preferable to use the former to account for the contextual meaning of language, but characterizing the frequency of sentiments can also be of interest.
There are several dictionaries of sentiment, but we are going to use the NRC Word-Emotion Association lexicon created by, and published in Saif M. Mohammad and Peter Turney. (2013), “Crowdsourcing a Word-Emotion Association Lexicon,” Computational Intelligence, 29(3): 436-465. This is available from the tidytext package, which associates words with 10 sentiments: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
As an aside, the way the sentiment was created was by showing a bunch
of people a series of words and then asking them what emotion they
associated with each word. The responses of regular people were then
aggregated to determine whether each word was associated with each
emotion (by effectively looking at the extent to which individuals
associated the same emotion with each word). This is what we mean by
“crowd-sourcing” – using people to collect data for us. Note that we
could create our own DSCI 1000 sentiment index. All we would need to do
is to have each of you indicate the emotion(s) you associate with a
series of words. For example, when you see ggplot do you
feel positive, negative,
anger, anticipation,
disgust, fear, joy,
sadness, surprise, or
trust? (Multiple emotions are allowed.)
So let’s load the NRC sentiment library in and take a look at what it
looks like by calling nrc to the console. (NB: you may need
to answer a question in your console after running the chunk below.
Answer “yes” by typing 1 in the console and hitting ENTER. If you are
unable to knit the HW it is probably because you have to run this chunk
first.)
library(tidytext)
library(textdata)
nrc <- get_sentiments("nrc")
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
Let’s get overall sentiment as a fraction of words used in tweets
grouped by PostPresidency (could also group by
Tweeting.year or Tweeting.hour).
Start by defining the relative frequency of each word being used pre/post presidency conditional on being tweeted at least 5 times.
word_freq <- tweet_words %>%
group_by(PostPresident) %>%
count(word) %>%
filter(sum(n) >= 5) %>%
mutate(prop = prop.table(n))
Now we use inner_join to select only the words in
tweet_words that are contained in the NRC sentiment
analysis word list. Note that inner_join keeps only the
observations that are common to both tweet_words and
nrc so it is going to keep just the words that have an
associated sentiment. Note that if we stem the word tokens this would be
problematic!
word_freq_sentiment <- word_freq %>%
inner_join(nrc, by = "word")
Note that the proportion prop that we calculated above
will no longer sum to 1 because: 1. we dropped words outside of the NRC
word list with the inner_join, 2. each word can appear in
multiple sentiments. That said, the proportion is still meaningful as a
measure of the relative importance of each word, even if the
interpretation of the value is somewhat problematic.
Now let’s plot the top most typically used words in each sentiment.
word_freq_sentiment %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
facet_wrap(~ sentiment, scales = "free", nrow = 3) +
geom_bar(stat = "identity") +
coord_flip()
One way to measure sentiment is simply the number of positive sentiments - the number of negative sentiments.
Let’s go back to our original tibble where each observation was a
word, use inner_join to extract sentiments and then create
a new tibble tweet_sentiment_summary that summarizes the
sentiment of each tweet.
tweet_sentiment <- tweet_words %>%
inner_join(nrc, by = "word")
tweet_sentiment_summary <- tweet_sentiment %>%
group_by(PostPresident, sentiment) %>%
count(document,sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
Note the use of pivot_wider here. This transforms a
“long” tibble (many rows) into a “wide” tibble (many columns). Now each
observation is a tweet (instead of a word in a tweet).
If we wanted to see how many total words were used in the all of the tweets we observed pre-post presidency we can summarize.
tweet_sentiment_summary %>%
group_by(PostPresident) %>%
mutate(ntweet = 1) %>%
summarize(across(-document, sum))
## # A tibble: 2 × 13
## PostPresident anger anticipation disgust fear joy negative positive sadness
## <lgl> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 FALSE 8125 13157 5332 7979 12274 14888 27152 7952
## 2 TRUE 13900 13695 8922 14069 10563 25573 31122 10984
## # ℹ 4 more variables: surprise <int>, trust <int>, sentiment <int>,
## # ntweet <dbl>
We can plot the distribution of sentiment scores pre/post presidency
as follows. Note that the position controls where the bars
for each fill are placed. (Try running it without
position to see the default.) The aes of
geom_bar is defined so as to plot the proportion rather
than the frequency. Here we have told ggplot to calculate the proportion
for us.
tweet_sentiment_summary %>%
ggplot(aes(x = sentiment, fill=PostPresident)) +
geom_bar(aes(y = ..prop..), position=position_dodge()) +
scale_y_continuous(labels = percent) +
labs(y= "Proportion of Tweets", x = "Sentiment Score: Positive - Negative", fill="Is President?")
As before, we could also use facet_wrap here. How to
think about nrow here. Should it be 1 or 2? What is the comparison you
want to make - comparing across x-axis or y-axis? Note also that we
chose scales="fixed" this time. Why is this important?
tweet_sentiment_summary %>%
ggplot(aes(x = sentiment)) +
facet_wrap(~ PostPresident, scales = "fixed", nrow = 2) +
geom_bar(aes(y = ..prop..)) +
scale_y_continuous(labels = percent) +
labs(y= "Proportion of Tweets", x = "Sentiment Score: Positive - Negative")
Can also see how it varies by hour. To do this we need to go back to
the original tibble to group_by
Tweeting.hour.
tweet_sentiment %>%
group_by(PostPresident,Tweeting.hour,sentiment) %>%
count(document,sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
summarize(AvgSentiment = mean(sentiment)) %>%
ggplot(aes(y = AvgSentiment, x= as.integer(Tweeting.hour), color=PostPresident)) +
geom_point() +
labs(x = "Tweeting Hour (EST)", y = "Average Tweet Sentiment: Positive - Negative", color = "Is President?") +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
We can also dive in deeper and focus on particular sentiments. What is the distribution of the number of “angry” words being used in each tweet? Was Trump getting “angrier” over time?
tweet_sentiment_summary %>%
ggplot(aes(x = anger, fill=PostPresident)) +
geom_bar(aes(y = ..prop..), position=position_dodge()) +
scale_y_continuous(labels = percent) +
labs(y= "Proportion of Tweets", x = "Number of `Angry` Words in Tweet", fill="Is President?")
Or angrier based on the time of day post-presidency?
tweet_sentiment %>%
filter(PostPresident==TRUE, sentiment == "anger") %>%
group_by(Tweeting.hour) %>%
count(document,sentiment) %>%
summarize(AvgAnger = mean(n)) %>%
ggplot(aes(y = AvgAnger, x= as.integer(Tweeting.hour))) +
geom_point() +
labs(x = "Average Tweeting Hour (EST)", y = "Avg. Number of Angry Words") +
scale_x_continuous(n.breaks = 13, limits = c(0,23))
And how about which words are most distinctively used in each sentiment (using log-odds ratio)?
prepost_ratios %>%
inner_join(nrc, by = "word") %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate(sentiment = reorder(sentiment, -logratio),
word = reorder(word, -logratio)) %>%
group_by(sentiment) %>%
top_n(10, abs(logratio)) %>%
ungroup() %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
facet_wrap(~ sentiment, scales = "free", nrow = 2) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "", y = "Post / Pre log ratio") +
scale_fill_manual(name = "", labels = c("Post", "Pre"),
values = c("red", "lightblue"))
Now you can blow the doors off of this!
How does the sentiment change over time? At various times of the day?
Note also that this is very basic stuff. Instead of defining sentiment at the token level and trying to aggregate up we can define it at the tweet level and then use the methods we talked about last time to try to classify tweet-level sentiments by determining how close each tweet is to the tweets classified according to each sentiment. This kind of “supervised” learning would require a tweet-level measure of sentiment that we could then predict using features as before (e.g., words, time of day, date). To do so we could build a prediction model and predict the tweet.
We can also try to use the most-frequently used sentiment to classify each tweet and see the most frequent “sentiment” associated with each tweet. If, for example, we were to classify the sentiment of each tweet using the modal sentiment (i.e., the most frequently appearing sentiment) we would get the following.
tweet_sentiment %>%
filter(sentiment != "positive" & sentiment != "negative") %>%
group_by(document) %>%
count(sentiment) %>%
top_n(1,n) %>% # select the most frequently appearing sentiment
group_by(sentiment) %>%
count() %>%
ungroup() %>%
mutate(PctSentiment = n/sum(n))
## # A tibble: 8 × 3
## sentiment n PctSentiment
## <chr> <int> <dbl>
## 1 anger 9403 0.112
## 2 anticipation 12180 0.145
## 3 disgust 5454 0.0651
## 4 fear 9562 0.114
## 5 joy 9703 0.116
## 6 sadness 8057 0.0962
## 7 surprise 8477 0.101
## 8 trust 20953 0.250
What do you think about this?
Now let’s graph the proportion of tweets according to the modal sentiment over time.
tweet_sentiment %>%
filter(sentiment != "positive" & sentiment != "negative") %>%
group_by(Tweeting.year,document) %>%
count(sentiment) %>%
top_n(1,n) %>%
group_by(Tweeting.year,sentiment) %>%
count() %>%
ungroup(sentiment) %>%
mutate(PctSentiment = n/sum(n)) %>%
ggplot(aes(x=Tweeting.year, y = PctSentiment, color = sentiment)) +
geom_point() +
labs(x = "Year", y = "Pct. Tweets with Modal Sentiment", color = "Model Tweet Sentiment") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))
You can literally do a thousand descriptive things with sentiment!